Buying a new house is a tedious and strenuous process. Prices of houses vary because of many factors like number of bedrooms, size of basement and many more. Thus we have made it easy for buyer to hunt for their dream house by developing a program that will predict house price depending on various attributes of the house. We have considered 24 attributes(database is described in the execution) to predict 1 variable i.e. SalesPrice. This program will help buyers to understand that if a particular house is in their budget or not. Many times broker increases the price of the house just to earn extra commision. Buyers will be able to understand if the broker is telling the actual price or not. This program can also help real estate brokers to sell the property at the right price.
It is our job to predict the sales price for each house. For each Id in the test set, we must predict the value of the SalePrice variable.
This program will help :
1. House buyers
2. Real estate brokers
Let us begin with understanding data.
Here we import all libraries.
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt # plotting
import numpy as np # linear algebra
import os # accessing directory structure
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
Let us read the dataset.
data=pd.read_csv("C:\\Users\\rmadh\\OneDrive\\Desktop\\Lecture_Notes\\IDS\\Project\\Main_data.csv")
Following steps will help us understand the profile of the dataset.
import pandas as pd
import pandas_profiling
pandas_profiling.ProfileReport(data)
data.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | Utilities | OverallQual | OverallCond | ... | LowQualFinSF | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | GarageArea | Fence | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | Reg | AllPub | 5 | 6 | ... | 0 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 730.0 | MnPrv | 169277.0525 |
| 1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | IR1 | AllPub | 6 | 6 | ... | 0 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 312.0 | NaN | 187758.3940 |
| 2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | IR1 | AllPub | 5 | 5 | ... | 0 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 482.0 | MnPrv | 183583.6836 |
| 3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | IR1 | AllPub | 6 | 6 | ... | 0 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 470.0 | NaN | 179317.4775 |
| 4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | IR1 | AllPub | 8 | 5 | ... | 0 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | 506.0 | NaN | 150730.0800 |
5 rows × 24 columns
Data cleaning is the process of detecting and correcting(or removing) corrupt or inaccurate records from a record set, table or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parta of the data and then replacing, modifying or deleting the dirty or coarse data.
Let us see how many values are null. False indicates that it is not a null value. True indicates it is a null value.
data.isnull()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | Utilities | OverallQual | OverallCond | ... | LowQualFinSF | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | GarageArea | Fence | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 2 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 3 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 4 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 5 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 6 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 7 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 8 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 9 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 10 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 11 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 12 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 13 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 14 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 15 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 16 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 17 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 18 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 19 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 20 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 21 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 22 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 23 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 24 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 25 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 26 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 27 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 28 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 29 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1429 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1430 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1431 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1432 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1433 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1434 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1435 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1436 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1437 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1438 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1439 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1440 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1441 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1442 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1443 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1444 | False | False | True | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1445 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1446 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1447 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1448 | False | False | False | True | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1449 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1450 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1451 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1452 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1453 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1454 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1455 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1456 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
| 1457 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | False | False |
| 1458 | False | False | False | False | False | False | False | False | False | False | ... | False | False | False | False | False | False | False | False | True | False |
1459 rows × 24 columns
From the above step we observed that column Fence and LowQualFinSF has many null values. Hence we will not consider those columns while predicting the prices. Let us drop the two columns.
data= data.drop(columns="Fence")
data= data.drop(columns="LowQualFinSF")
data.head()
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | Utilities | OverallQual | OverallCond | ... | 1stFlrSF | 2ndFlrSF | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | GarageArea | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | Reg | AllPub | 5 | 6 | ... | 896 | 0 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 730.0 | 169277.0525 |
| 1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | IR1 | AllPub | 6 | 6 | ... | 1329 | 0 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 312.0 | 187758.3940 |
| 2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | IR1 | AllPub | 5 | 5 | ... | 928 | 701 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 482.0 | 183583.6836 |
| 3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | IR1 | AllPub | 6 | 6 | ... | 926 | 678 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 470.0 | 179317.4775 |
| 4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | IR1 | AllPub | 8 | 5 | ... | 1280 | 0 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | 506.0 | 150730.0800 |
5 rows × 22 columns
In above step we have dropped columns with null values. Let us drop rows which contains 'NA' values. Rows will 'NA' values are of no use while predicting SalePrices of house.
df=data.dropna(how='any',axis=0)
df
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | LotShape | Utilities | OverallQual | OverallCond | ... | 1stFlrSF | 2ndFlrSF | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | GarageArea | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | Reg | AllPub | 5 | 6 | ... | 896 | 0 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 730.0 | 169277.0525 |
| 1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | IR1 | AllPub | 6 | 6 | ... | 1329 | 0 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 312.0 | 187758.3940 |
| 2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | IR1 | AllPub | 5 | 5 | ... | 928 | 701 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 482.0 | 183583.6836 |
| 3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | IR1 | AllPub | 6 | 6 | ... | 926 | 678 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 470.0 | 179317.4775 |
| 4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | IR1 | AllPub | 8 | 5 | ... | 1280 | 0 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | 506.0 | 150730.0800 |
| 5 | 1466 | 60 | RL | 75.0 | 10000 | Pave | IR1 | AllPub | 6 | 5 | ... | 763 | 892 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 440.0 | 177150.9892 |
| 7 | 1468 | 60 | RL | 63.0 | 8402 | Pave | IR1 | AllPub | 6 | 5 | ... | 789 | 676 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 393.0 | 175110.9565 |
| 8 | 1469 | 20 | RL | 85.0 | 10176 | Pave | Reg | AllPub | 7 | 5 | ... | 1341 | 0 | 1.0 | 0.0 | 1 | 1 | 2 | 1 | 506.0 | 162011.6988 |
| 9 | 1470 | 20 | RL | 70.0 | 8400 | Pave | Reg | AllPub | 4 | 5 | ... | 882 | 0 | 1.0 | 0.0 | 1 | 0 | 2 | 1 | 525.0 | 160726.2478 |
| 10 | 1471 | 120 | RH | 26.0 | 5858 | Pave | IR1 | AllPub | 7 | 5 | ... | 1337 | 0 | 1.0 | 0.0 | 2 | 0 | 2 | 1 | 511.0 | 157933.2795 |
| 11 | 1472 | 160 | RM | 21.0 | 1680 | Pave | Reg | AllPub | 6 | 5 | ... | 483 | 504 | 0.0 | 0.0 | 1 | 1 | 2 | 1 | 264.0 | 145291.2450 |
| 12 | 1473 | 160 | RM | 21.0 | 1680 | Pave | Reg | AllPub | 5 | 5 | ... | 525 | 567 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 320.0 | 159672.0176 |
| 13 | 1474 | 160 | RL | 24.0 | 2280 | Pave | Reg | AllPub | 6 | 6 | ... | 855 | 601 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 440.0 | 164167.5183 |
| 14 | 1475 | 120 | RL | 24.0 | 2280 | Pave | Reg | AllPub | 7 | 6 | ... | 836 | 0 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 308.0 | 150891.6382 |
| 15 | 1476 | 60 | RL | 102.0 | 12858 | Pave | IR1 | AllPub | 9 | 5 | ... | 1627 | 707 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 751.0 | 179460.9652 |
| 16 | 1477 | 20 | RL | 94.0 | 12883 | Pave | IR1 | AllPub | 8 | 5 | ... | 1544 | 0 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 868.0 | 185034.6289 |
| 17 | 1478 | 20 | RL | 90.0 | 11520 | Pave | Reg | AllPub | 9 | 5 | ... | 1698 | 0 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 730.0 | 182352.1926 |
| 18 | 1479 | 20 | RL | 79.0 | 14122 | Pave | IR1 | AllPub | 8 | 5 | ... | 1822 | 0 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 678.0 | 183053.4582 |
| 19 | 1480 | 20 | RL | 110.0 | 14300 | Pave | Reg | AllPub | 9 | 5 | ... | 2696 | 0 | 1.0 | 0.0 | 2 | 1 | 3 | 1 | 958.0 | 187823.3393 |
| 20 | 1481 | 60 | RL | 105.0 | 13650 | Pave | Reg | AllPub | 8 | 5 | ... | 1687 | 563 | 1.0 | 0.0 | 2 | 1 | 3 | 1 | 756.0 | 186544.1143 |
| 21 | 1482 | 120 | RL | 41.0 | 7132 | Pave | IR1 | AllPub | 8 | 5 | ... | 1370 | 0 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | 484.0 | 158230.7752 |
| 22 | 1483 | 20 | RL | 100.0 | 18494 | Pave | IR1 | AllPub | 6 | 5 | ... | 1324 | 0 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 430.0 | 190552.8293 |
| 23 | 1484 | 120 | RL | 43.0 | 3203 | Pave | Reg | AllPub | 7 | 5 | ... | 1145 | 0 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | 437.0 | 147183.6749 |
| 24 | 1485 | 80 | RL | 67.0 | 13300 | Pave | IR1 | AllPub | 7 | 5 | ... | 744 | 630 | 1.0 | 0.0 | 2 | 1 | 3 | 1 | 400.0 | 185855.3009 |
| 25 | 1486 | 60 | RL | 63.0 | 8577 | Pave | IR1 | AllPub | 7 | 5 | ... | 847 | 886 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 433.0 | 174350.4707 |
| 26 | 1487 | 60 | RL | 60.0 | 17433 | Pave | IR2 | AllPub | 8 | 5 | ... | 1645 | 830 | 0.0 | 0.0 | 2 | 1 | 4 | 1 | 962.0 | 201740.6207 |
| 27 | 1488 | 20 | RL | 73.0 | 8987 | Pave | Reg | AllPub | 8 | 5 | ... | 1595 | 0 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | 880.0 | 162986.3789 |
| 28 | 1489 | 20 | FV | 92.0 | 9215 | Pave | Reg | AllPub | 7 | 5 | ... | 1218 | 0 | 0.0 | 0.0 | 2 | 0 | 2 | 1 | 676.0 | 162330.1991 |
| 29 | 1490 | 20 | FV | 84.0 | 10440 | Pave | Reg | AllPub | 6 | 5 | ... | 1468 | 0 | 1.0 | 0.0 | 2 | 0 | 2 | 1 | 528.0 | 165845.9386 |
| 30 | 1491 | 60 | RL | 70.0 | 11920 | Pave | Reg | AllPub | 7 | 5 | ... | 831 | 828 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 484.0 | 180929.6229 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1425 | 2886 | 60 | RL | 78.0 | 15600 | Pave | Reg | AllPub | 5 | 7 | ... | 1278 | 1037 | 1.0 | 0.0 | 2 | 0 | 4 | 1 | 342.0 | 208353.2696 |
| 1426 | 2887 | 30 | RM | 50.0 | 7288 | Pave | Reg | AllPub | 5 | 6 | ... | 976 | 0 | 1.0 | 0.0 | 1 | 0 | 2 | 1 | 215.0 | 166548.0672 |
| 1427 | 2888 | 50 | RM | 50.0 | 7000 | Pave | Reg | AllPub | 6 | 7 | ... | 861 | 424 | 0.0 | 1.0 | 1 | 0 | 3 | 1 | 506.0 | 175942.4746 |
| 1428 | 2889 | 30 | RM | 61.0 | 8534 | Pave | Reg | AllPub | 4 | 4 | ... | 672 | 0 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 0.0 | 166790.4579 |
| 1429 | 2890 | 30 | RM | 50.0 | 7030 | Pave | Reg | AllPub | 4 | 6 | ... | 641 | 0 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 272.0 | 160515.8506 |
| 1430 | 2891 | 50 | RM | 75.0 | 9060 | Pave | Reg | AllPub | 6 | 5 | ... | 967 | 671 | 0.0 | 0.0 | 2 | 0 | 4 | 1 | 384.0 | 192167.6211 |
| 1431 | 2892 | 30 | C (all) | 69.0 | 12366 | Pave | Reg | AllPub | 3 | 5 | ... | 729 | 0 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 0.0 | 178751.5511 |
| 1432 | 2893 | 190 | C (all) | 50.0 | 9000 | Pave | Reg | AllPub | 5 | 6 | ... | 1060 | 336 | 0.0 | 0.0 | 2 | 0 | 4 | 2 | 0.0 | 198678.8941 |
| 1433 | 2894 | 50 | C (all) | 60.0 | 8520 | Grvl | Reg | AllPub | 3 | 5 | ... | 576 | 360 | 0.0 | 0.0 | 1 | 0 | 2 | 1 | 0.0 | 164553.1203 |
| 1434 | 2895 | 120 | RM | 41.0 | 5748 | Pave | IR1 | AllPub | 8 | 5 | ... | 1778 | 0 | 2.0 | 0.0 | 2 | 0 | 2 | 1 | 495.0 | 156887.9329 |
| 1435 | 2896 | 120 | RM | 44.0 | 3842 | Pave | IR1 | AllPub | 8 | 5 | ... | 1646 | 0 | 1.0 | 1.0 | 2 | 0 | 2 | 1 | 525.0 | 164185.7773 |
| 1436 | 2897 | 20 | RL | 69.0 | 23580 | Pave | IR1 | AllPub | 6 | 6 | ... | 1625 | 0 | 0.0 | 1.0 | 2 | 0 | 3 | 1 | 576.0 | 212992.1206 |
| 1437 | 2898 | 90 | RL | 65.0 | 8385 | Pave | Reg | AllPub | 6 | 5 | ... | 1664 | 0 | 0.0 | 0.0 | 2 | 0 | 4 | 2 | 616.0 | 197468.5505 |
| 1438 | 2899 | 20 | RL | 70.0 | 9116 | Pave | Reg | AllPub | 8 | 5 | ... | 1491 | 0 | 0.0 | 0.0 | 2 | 0 | 3 | 1 | 490.0 | 180106.8437 |
| 1439 | 2900 | 80 | RL | 140.0 | 11080 | Pave | Reg | AllPub | 6 | 6 | ... | 1210 | 0 | 1.0 | 0.0 | 2 | 0 | 3 | 1 | 528.0 | 183972.0711 |
| 1442 | 2903 | 20 | RL | 95.0 | 13618 | Pave | Reg | AllPub | 8 | 5 | ... | 1960 | 0 | 1.0 | 0.0 | 2 | 0 | 3 | 1 | 714.0 | 195596.3077 |
| 1443 | 2904 | 20 | RL | 88.0 | 11577 | Pave | Reg | AllPub | 9 | 5 | ... | 1838 | 0 | 1.0 | 0.0 | 2 | 0 | 3 | 1 | 682.0 | 189369.7563 |
| 1445 | 2906 | 90 | RM | 78.0 | 7020 | Pave | Reg | AllPub | 7 | 5 | ... | 1368 | 0 | 2.0 | 0.0 | 2 | 0 | 2 | 2 | 784.0 | 169335.3106 |
| 1446 | 2907 | 160 | RM | 41.0 | 2665 | Pave | Reg | AllPub | 5 | 6 | ... | 616 | 688 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 336.0 | 167411.0284 |
| 1447 | 2908 | 20 | RL | 58.0 | 10172 | Pave | IR1 | AllPub | 5 | 7 | ... | 874 | 0 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 288.0 | 187709.5550 |
| 1449 | 2910 | 180 | RM | 21.0 | 1470 | Pave | Reg | AllPub | 4 | 6 | ... | 630 | 0 | 1.0 | 0.0 | 1 | 0 | 1 | 1 | 0.0 | 137402.5699 |
| 1450 | 2911 | 160 | RM | 21.0 | 1484 | Pave | Reg | AllPub | 4 | 4 | ... | 546 | 546 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 253.0 | 165086.7751 |
| 1451 | 2912 | 20 | RL | 80.0 | 13384 | Pave | Reg | AllPub | 5 | 5 | ... | 1360 | 0 | 1.0 | 0.0 | 1 | 0 | 3 | 1 | 336.0 | 188506.4314 |
| 1452 | 2913 | 160 | RM | 21.0 | 1533 | Pave | Reg | AllPub | 4 | 5 | ... | 546 | 546 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 286.0 | 172917.4568 |
| 1453 | 2914 | 160 | RM | 21.0 | 1526 | Pave | Reg | AllPub | 4 | 5 | ... | 546 | 546 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 0.0 | 166274.3252 |
| 1454 | 2915 | 160 | RM | 21.0 | 1936 | Pave | Reg | AllPub | 4 | 7 | ... | 546 | 546 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 0.0 | 167081.2209 |
| 1455 | 2916 | 160 | RM | 21.0 | 1894 | Pave | Reg | AllPub | 4 | 5 | ... | 546 | 546 | 0.0 | 0.0 | 1 | 1 | 3 | 1 | 286.0 | 164788.7782 |
| 1456 | 2917 | 20 | RL | 160.0 | 20000 | Pave | Reg | AllPub | 5 | 7 | ... | 1224 | 0 | 1.0 | 0.0 | 1 | 0 | 4 | 1 | 576.0 | 219222.4234 |
| 1457 | 2918 | 85 | RL | 62.0 | 10441 | Pave | Reg | AllPub | 5 | 5 | ... | 970 | 0 | 0.0 | 1.0 | 1 | 0 | 3 | 1 | 0.0 | 184924.2797 |
| 1458 | 2919 | 60 | RL | 74.0 | 9627 | Pave | Reg | AllPub | 7 | 5 | ... | 996 | 1004 | 0.0 | 0.0 | 2 | 1 | 3 | 1 | 650.0 | 187741.8667 |
1226 rows × 22 columns
We have finished cleaning the dataset. Let us see the description of our final dataset.
data.describe()
| Id | MSSubClass | LotFrontage | LotArea | OverallQual | OverallCond | YearBuilt | 1stFlrSF | 2ndFlrSF | BsmtFullBath | BsmtHalfBath | FullBath | HalfBath | BedroomAbvGr | KitchenAbvGr | GarageArea | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1459.000000 | 1459.000000 | 1232.000000 | 1459.000000 | 1459.000000 | 1459.000000 | 1459.000000 | 1459.000000 | 1459.000000 | 1457.000000 | 1457.000000 | 1459.000000 | 1459.000000 | 1459.000000 | 1459.000000 | 1458.000000 | 1459.000000 |
| mean | 2190.000000 | 57.378341 | 68.580357 | 9819.161069 | 6.078821 | 5.553804 | 1971.357779 | 1156.534613 | 325.967786 | 0.434454 | 0.065202 | 1.570939 | 0.377656 | 2.854010 | 1.042495 | 472.768861 | 179183.918243 |
| std | 421.321334 | 42.746880 | 22.376841 | 4955.517327 | 1.436812 | 1.113740 | 30.390071 | 398.165820 | 420.610226 | 0.530648 | 0.252468 | 0.555190 | 0.503017 | 0.829788 | 0.208472 | 217.048611 | 16518.303051 |
| min | 1461.000000 | 20.000000 | 21.000000 | 1470.000000 | 1.000000 | 1.000000 | 1879.000000 | 407.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 135751.318900 |
| 25% | 1825.500000 | 20.000000 | 58.000000 | 7391.000000 | 5.000000 | 5.000000 | 1953.000000 | 873.500000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 318.000000 | 168703.011200 |
| 50% | 2190.000000 | 50.000000 | 67.000000 | 9399.000000 | 6.000000 | 5.000000 | 1973.000000 | 1079.000000 | 0.000000 | 0.000000 | 0.000000 | 2.000000 | 0.000000 | 3.000000 | 1.000000 | 480.000000 | 179208.665700 |
| 75% | 2554.500000 | 70.000000 | 80.000000 | 11517.500000 | 7.000000 | 6.000000 | 2001.000000 | 1382.500000 | 676.000000 | 1.000000 | 0.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 576.000000 | 186789.409400 |
| max | 2919.000000 | 190.000000 | 200.000000 | 56600.000000 | 10.000000 | 9.000000 | 2010.000000 | 5095.000000 | 1862.000000 | 3.000000 | 2.000000 | 4.000000 | 2.000000 | 6.000000 | 2.000000 | 1488.000000 | 281643.976100 |
Data visualization is the graphic representation of data. It involves producing images that communicate relationships among the represented data to viewers of the images.
Let us begin with visualization by importing libraries.
%matplotlib inline
import numpy as np
from scipy import stats, integrate
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(color_codes= True)
Most convinient way to check univariate distribution in seaborn is displot(). It draws a histogram and fit a kernel density estimate (KDE).
plt.figure(figsize=(9,8)) #this sets the size of figure
sns.distplot(df['SalePrice'],bins=50) #with this we plot histogram of SalePrice and give it's kernel density estimate.
print("Skewness: %f" % df['SalePrice'].skew())
print("Kurtosis: %f" % df['SalePrice'].kurt())
Skewness: 0.724931 Kurtosis: 2.650423
cor_mat= df[:].corr()
cor_with_tar=cor_mat.sort_values(['SalePrice'],ascending=False)
print("The most relevant features (numeric) for the target are :")
cor_with_tar.SalePrice
The most relevant features (numeric) for the target are :
SalePrice 1.000000 BedroomAbvGr 0.813514 LotArea 0.686926 LotFrontage 0.499447 2ndFlrSF 0.344355 1stFlrSF 0.319452 FullBath 0.314649 GarageArea 0.245892 HalfBath 0.206062 KitchenAbvGr 0.169055 Id 0.138676 OverallQual 0.107261 YearBuilt 0.015349 BsmtHalfBath -0.030641 BsmtFullBath -0.052726 OverallCond -0.075004 MSSubClass -0.199732 Name: SalePrice, dtype: float64
Note that some of the features have quite high corelation with the target. These features are really significant.
Of these the features with corelation value >0.5 are really important. Some features like BedroomAbvGr etc.. are even more important.
We will consider these features (i.e. BedroomAbvGr,LotArea) etc.. in more detail in subsequent sections during univariate and bivariate analysis.
# using a corelation map to visualize features with high corelation.
cor_mat= df[['BedroomAbvGr','LotArea','LotFrontage','SalePrice']].corr()
mask = np.array(cor_mat)
mask[np.tril_indices_from(mask)] = False
fig=plt.gcf()
fig.set_size_inches(30,12)
sns.heatmap(data=cor_mat,mask=mask,square=True,annot=True,cbar=True)
<matplotlib.axes._subplots.AxesSubplot at 0x2d8142967f0>
In this section the univariate analysis is performed; More importantly the features that are more importanht with the 'Target' that have high corelation with the Target.
For Numeric features distplot' is used and 'boxplot' is used to analyze their distribution
def plot_cat(feature):
sns.countplot(data=df,x=feature)
ax=sns.countplot(data=df,x=feature)
plot_cat('OverallQual')
# garage area
fig, ax = plt.subplots()
ax.scatter(x =(df['GarageArea']), y = df['SalePrice'])
plt.ylabel('SalePrice')
plt.xlabel('GarageArea')
plt.show()
# can try to fremove the points with gargae rea > than 1200.
# garage area
fig, ax = plt.subplots()
ax.scatter(x =(df['LotFrontage']), y = df['SalePrice'])
plt.ylabel('SalePrice')
plt.xlabel('LotFrontage')
plt.show()
# can try to fremove the points with gargae rea > than 1200.
sns.jointplot(x="YearBuilt",y="SalePrice",data=df)
<seaborn.axisgrid.JointGrid at 0x2d8096e7240>
As it can be observed above maximum number of houses were sold in year 2000 and highest SalePrice was around 2800000 houses in the range from 1400000 to 200000 are having highest sale.
fig1 = fig.add_subplot(221); sns.boxplot(x='OverallQual', y='SalePrice',
data=df[['SalePrice', 'OverallQual']])
<matplotlib.axes._subplots.AxesSubplot at 0x2d80c56ab38>
The SalePrice increases with the overall quality as expected.
fig4 = fig.add_subplot(224);
sns.scatterplot(x = df.GarageArea, y = df.SalePrice, hue=df.OverallQual, palette= 'Spectral')
<matplotlib.axes._subplots.AxesSubplot at 0x2d80c6d4860>
fig7 = fig5.add_subplot(122);
sns.scatterplot(y = df.SalePrice, x = df['1stFlrSF'], hue=df.OverallQual, palette= 'YlOrRd')
<matplotlib.axes._subplots.AxesSubplot at 0x2d80c7a2780>
There are 4 bathroom variables in our data set. FullBath has the largest correlation with SalePrice between the others individually.
fig = plt.figure(figsize=(20,10))
fig1 = fig.add_subplot(221); sns.regplot(x='FullBath', y='SalePrice', data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.FullBath.corr(df['SalePrice'])))
fig2 = fig.add_subplot(222); sns.regplot(x='HalfBath', y='SalePrice', data=df);
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.HalfBath.corr(df['SalePrice'])))
fig3 = fig.add_subplot(223); sns.regplot(x='BsmtFullBath', y='SalePrice',
data=df)
plt.title('Correlation with SalePrice: {:6.4f}'.format(df.BsmtFullBath.corr(df['SalePrice'])))
fig4 = fig.add_subplot(224); sns.regplot(x='BsmtHalfBath', y='SalePrice',
data=df);plt.title('Correlation with SalePrice: {:6.4f}'.format(df.HalfBath.corr(df['SalePrice'])))
plt.show()
for col in df.columns:
print(col)
Id MSSubClass MSZoning LotFrontage LotArea Street LotShape Utilities OverallQual OverallCond YearBuilt Heating 1stFlrSF 2ndFlrSF BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr GarageArea SalePrice
results = df.dtypes # check the data type of result table
print(results)
Id int64 MSSubClass int64 MSZoning object LotFrontage float64 LotArea int64 Street object LotShape object Utilities object OverallQual int64 OverallCond int64 YearBuilt int64 Heating object 1stFlrSF int64 2ndFlrSF int64 BsmtFullBath float64 BsmtHalfBath float64 FullBath int64 HalfBath int64 BedroomAbvGr int64 KitchenAbvGr int64 GarageArea float64 SalePrice float64 dtype: object
#Converting columns to categorical columns in order to do regression:
continuous_columns = ['Id','MSSubClass','LotFrontage','LotArea','OverallQual','OverallCond','YearBuilt','1stFlrSF','2ndFlrSF','BsmtFullBath','BsmtHalfBath',
'FullBath','HalfBath','BedroomAbvGr','GarageArea','SalePrice','KitchenAbvGr']
categorical_columns = [col for col in df.columns if col not in continuous_columns]
print(categorical_columns)
['MSZoning', 'Street', 'LotShape', 'Utilities', 'Heating']
# Columns converted to catogorical columns using as_ordered.
for col_name, col in df[categorical_columns].items():
df[col_name] = col.astype('category').cat.as_ordered()
print(df.dtypes)
Id int64 MSSubClass int64 MSZoning category LotFrontage float64 LotArea int64 Street category LotShape category Utilities category OverallQual int64 OverallCond int64 YearBuilt int64 Heating category 1stFlrSF int64 2ndFlrSF int64 BsmtFullBath float64 BsmtHalfBath float64 FullBath int64 HalfBath int64 BedroomAbvGr int64 KitchenAbvGr int64 GarageArea float64 SalePrice float64 dtype: object
housing_data = df.copy()
housing_train_data = pd.get_dummies(housing_data, columns=['MSZoning','LotShape','Utilities','Heating','Street'], prefix = ['MSZoning','LotShape','Utilities','Heating','Street'])
print(housing_train_data.dtypes)
Id int64 MSSubClass int64 LotFrontage float64 LotArea int64 OverallQual int64 OverallCond int64 YearBuilt int64 1stFlrSF int64 2ndFlrSF int64 BsmtFullBath float64 BsmtHalfBath float64 FullBath int64 HalfBath int64 BedroomAbvGr int64 KitchenAbvGr int64 GarageArea float64 SalePrice float64 MSZoning_C (all) uint8 MSZoning_FV uint8 MSZoning_RH uint8 MSZoning_RL uint8 MSZoning_RM uint8 LotShape_IR1 uint8 LotShape_IR2 uint8 LotShape_IR3 uint8 LotShape_Reg uint8 Utilities_AllPub uint8 Heating_GasA uint8 Heating_GasW uint8 Heating_Grav uint8 Heating_Wall uint8 Street_Grvl uint8 Street_Pave uint8 dtype: object
print(housing_train_data)
Id MSSubClass LotFrontage LotArea OverallQual OverallCond \
0 1461 20 80.0 11622 5 6
1 1462 20 81.0 14267 6 6
2 1463 60 74.0 13830 5 5
3 1464 60 78.0 9978 6 6
4 1465 120 43.0 5005 8 5
5 1466 60 75.0 10000 6 5
7 1468 60 63.0 8402 6 5
8 1469 20 85.0 10176 7 5
9 1470 20 70.0 8400 4 5
10 1471 120 26.0 5858 7 5
11 1472 160 21.0 1680 6 5
12 1473 160 21.0 1680 5 5
13 1474 160 24.0 2280 6 6
14 1475 120 24.0 2280 7 6
15 1476 60 102.0 12858 9 5
16 1477 20 94.0 12883 8 5
17 1478 20 90.0 11520 9 5
18 1479 20 79.0 14122 8 5
19 1480 20 110.0 14300 9 5
20 1481 60 105.0 13650 8 5
21 1482 120 41.0 7132 8 5
22 1483 20 100.0 18494 6 5
23 1484 120 43.0 3203 7 5
24 1485 80 67.0 13300 7 5
25 1486 60 63.0 8577 7 5
26 1487 60 60.0 17433 8 5
27 1488 20 73.0 8987 8 5
28 1489 20 92.0 9215 7 5
29 1490 20 84.0 10440 6 5
30 1491 60 70.0 11920 7 5
... ... ... ... ... ... ...
1425 2886 60 78.0 15600 5 7
1426 2887 30 50.0 7288 5 6
1427 2888 50 50.0 7000 6 7
1428 2889 30 61.0 8534 4 4
1429 2890 30 50.0 7030 4 6
1430 2891 50 75.0 9060 6 5
1431 2892 30 69.0 12366 3 5
1432 2893 190 50.0 9000 5 6
1433 2894 50 60.0 8520 3 5
1434 2895 120 41.0 5748 8 5
1435 2896 120 44.0 3842 8 5
1436 2897 20 69.0 23580 6 6
1437 2898 90 65.0 8385 6 5
1438 2899 20 70.0 9116 8 5
1439 2900 80 140.0 11080 6 6
1442 2903 20 95.0 13618 8 5
1443 2904 20 88.0 11577 9 5
1445 2906 90 78.0 7020 7 5
1446 2907 160 41.0 2665 5 6
1447 2908 20 58.0 10172 5 7
1449 2910 180 21.0 1470 4 6
1450 2911 160 21.0 1484 4 4
1451 2912 20 80.0 13384 5 5
1452 2913 160 21.0 1533 4 5
1453 2914 160 21.0 1526 4 5
1454 2915 160 21.0 1936 4 7
1455 2916 160 21.0 1894 4 5
1456 2917 20 160.0 20000 5 7
1457 2918 85 62.0 10441 5 5
1458 2919 60 74.0 9627 7 5
YearBuilt 1stFlrSF 2ndFlrSF BsmtFullBath ... LotShape_IR2 \
0 1961 896 0 0.0 ... 0
1 1958 1329 0 0.0 ... 0
2 1997 928 701 0.0 ... 0
3 1998 926 678 0.0 ... 0
4 1992 1280 0 0.0 ... 0
5 1993 763 892 0.0 ... 0
7 1998 789 676 0.0 ... 0
8 1990 1341 0 1.0 ... 0
9 1970 882 0 1.0 ... 0
10 1999 1337 0 1.0 ... 0
11 1971 483 504 0.0 ... 0
12 1971 525 567 0.0 ... 0
13 1975 855 601 0.0 ... 0
14 1975 836 0 0.0 ... 0
15 2009 1627 707 0.0 ... 0
16 2009 1544 0 0.0 ... 0
17 2005 1698 0 0.0 ... 0
18 2005 1822 0 0.0 ... 0
19 2003 2696 0 1.0 ... 0
20 2002 1687 563 1.0 ... 0
21 2006 1370 0 0.0 ... 0
22 2005 1324 0 0.0 ... 0
23 2006 1145 0 0.0 ... 0
24 2004 744 630 1.0 ... 0
25 2004 847 886 0.0 ... 0
26 1998 1645 830 0.0 ... 1
27 2005 1595 0 0.0 ... 0
28 2009 1218 0 0.0 ... 0
29 2005 1468 0 1.0 ... 0
30 2004 831 828 0.0 ... 0
... ... ... ... ... ... ...
1425 1950 1278 1037 1.0 ... 0
1426 1942 976 0 1.0 ... 0
1427 1926 861 424 0.0 ... 0
1428 1925 672 0 0.0 ... 0
1429 1925 641 0 0.0 ... 0
1430 1957 967 671 0.0 ... 0
1431 1945 729 0 0.0 ... 0
1432 1951 1060 336 0.0 ... 0
1433 1916 576 360 0.0 ... 0
1434 2005 1778 0 2.0 ... 0
1435 2004 1646 0 1.0 ... 0
1436 1979 1625 0 0.0 ... 0
1437 1978 1664 0 0.0 ... 0
1438 2001 1491 0 0.0 ... 0
1439 1975 1210 0 1.0 ... 0
1442 2005 1960 0 1.0 ... 0
1443 2005 1838 0 1.0 ... 0
1445 1997 1368 0 2.0 ... 0
1446 1977 616 688 0.0 ... 0
1447 1968 874 0 1.0 ... 0
1449 1970 630 0 1.0 ... 0
1450 1972 546 546 0.0 ... 0
1451 1969 1360 0 1.0 ... 0
1452 1970 546 546 0.0 ... 0
1453 1970 546 546 0.0 ... 0
1454 1970 546 546 0.0 ... 0
1455 1970 546 546 0.0 ... 0
1456 1960 1224 0 1.0 ... 0
1457 1992 970 0 0.0 ... 0
1458 1993 996 1004 0.0 ... 0
LotShape_IR3 LotShape_Reg Utilities_AllPub Heating_GasA \
0 0 1 1 1
1 0 0 1 1
2 0 0 1 1
3 0 0 1 1
4 0 0 1 1
5 0 0 1 1
7 0 0 1 1
8 0 1 1 1
9 0 1 1 1
10 0 0 1 1
11 0 1 1 1
12 0 1 1 1
13 0 1 1 1
14 0 1 1 1
15 0 0 1 1
16 0 0 1 1
17 0 1 1 1
18 0 0 1 1
19 0 1 1 1
20 0 1 1 1
21 0 0 1 1
22 0 0 1 1
23 0 1 1 1
24 0 0 1 1
25 0 0 1 1
26 0 0 1 1
27 0 1 1 1
28 0 1 1 1
29 0 1 1 1
30 0 1 1 1
... ... ... ... ...
1425 0 1 1 1
1426 0 1 1 1
1427 0 1 1 1
1428 0 1 1 1
1429 0 1 1 1
1430 0 1 1 1
1431 0 1 1 1
1432 0 1 1 1
1433 0 1 1 1
1434 0 0 1 1
1435 0 0 1 1
1436 0 0 1 1
1437 0 1 1 1
1438 0 1 1 1
1439 0 1 1 1
1442 0 1 1 1
1443 0 1 1 1
1445 0 1 1 1
1446 0 1 1 1
1447 0 0 1 1
1449 0 1 1 1
1450 0 1 1 1
1451 0 1 1 1
1452 0 1 1 1
1453 0 1 1 1
1454 0 1 1 1
1455 0 1 1 1
1456 0 1 1 1
1457 0 1 1 1
1458 0 1 1 1
Heating_GasW Heating_Grav Heating_Wall Street_Grvl Street_Pave
0 0 0 0 0 1
1 0 0 0 0 1
2 0 0 0 0 1
3 0 0 0 0 1
4 0 0 0 0 1
5 0 0 0 0 1
7 0 0 0 0 1
8 0 0 0 0 1
9 0 0 0 0 1
10 0 0 0 0 1
11 0 0 0 0 1
12 0 0 0 0 1
13 0 0 0 0 1
14 0 0 0 0 1
15 0 0 0 0 1
16 0 0 0 0 1
17 0 0 0 0 1
18 0 0 0 0 1
19 0 0 0 0 1
20 0 0 0 0 1
21 0 0 0 0 1
22 0 0 0 0 1
23 0 0 0 0 1
24 0 0 0 0 1
25 0 0 0 0 1
26 0 0 0 0 1
27 0 0 0 0 1
28 0 0 0 0 1
29 0 0 0 0 1
30 0 0 0 0 1
... ... ... ... ... ...
1425 0 0 0 0 1
1426 0 0 0 0 1
1427 0 0 0 0 1
1428 0 0 0 0 1
1429 0 0 0 0 1
1430 0 0 0 0 1
1431 0 0 0 0 1
1432 0 0 0 0 1
1433 0 0 0 1 0
1434 0 0 0 0 1
1435 0 0 0 0 1
1436 0 0 0 0 1
1437 0 0 0 0 1
1438 0 0 0 0 1
1439 0 0 0 0 1
1442 0 0 0 0 1
1443 0 0 0 0 1
1445 0 0 0 0 1
1446 0 0 0 0 1
1447 0 0 0 0 1
1449 0 0 0 0 1
1450 0 0 0 0 1
1451 0 0 0 0 1
1452 0 0 0 0 1
1453 0 0 0 0 1
1454 0 0 0 0 1
1455 0 0 0 0 1
1456 0 0 0 0 1
1457 0 0 0 0 1
1458 0 0 0 0 1
[1226 rows x 33 columns]
# Plotting the correlation matrix and observe the heatmap
corrmatrix = housing_train_data.corr()
f, ax = plt.subplots(figsize=(12, 9))
sns.heatmap(corrmatrix, vmax=.8, square=True);
# correlation matrix with the correlation values in heatmap
plt.figure(figsize=(100,80))
cor = housing_train_data.corr()
sns.set(font_scale=4)
sns.heatmap(cor, annot=True, cmap=plt.cm.Reds,annot_kws={"size": 45},vmax=.8, square=True)
plt.show()
#take the threshold value as 0.2 and all the columns having value greater than threshold are selected for regression
threshold = 0.2
filtered_data = abs(cor['SalePrice'])
result = filtered_data[filtered_data>0.2]
print(result)
LotFrontage 0.499447 LotArea 0.686926 1stFlrSF 0.319452 2ndFlrSF 0.344355 FullBath 0.314649 HalfBath 0.206062 BedroomAbvGr 0.813514 GarageArea 0.245892 SalePrice 1.000000 MSZoning_RL 0.278746 MSZoning_RM 0.217509 Name: SalePrice, dtype: float64
#form a final train data on which regression is to be done
final_train_data = housing_train_data[['MSSubClass','LotFrontage','LotArea','1stFlrSF','2ndFlrSF','FullBath','HalfBath',
'BedroomAbvGr','GarageArea','SalePrice','MSZoning_RL','MSZoning_RM']]
print(final_train_data)
MSSubClass LotFrontage LotArea 1stFlrSF 2ndFlrSF FullBath \
0 20 80.0 11622 896 0 1
1 20 81.0 14267 1329 0 1
2 60 74.0 13830 928 701 2
3 60 78.0 9978 926 678 2
4 120 43.0 5005 1280 0 2
5 60 75.0 10000 763 892 2
7 60 63.0 8402 789 676 2
8 20 85.0 10176 1341 0 1
9 20 70.0 8400 882 0 1
10 120 26.0 5858 1337 0 2
11 160 21.0 1680 483 504 1
12 160 21.0 1680 525 567 1
13 160 24.0 2280 855 601 2
14 120 24.0 2280 836 0 1
15 60 102.0 12858 1627 707 2
16 20 94.0 12883 1544 0 2
17 20 90.0 11520 1698 0 2
18 20 79.0 14122 1822 0 2
19 20 110.0 14300 2696 0 2
20 60 105.0 13650 1687 563 2
21 120 41.0 7132 1370 0 2
22 20 100.0 18494 1324 0 2
23 120 43.0 3203 1145 0 2
24 80 67.0 13300 744 630 2
25 60 63.0 8577 847 886 2
26 60 60.0 17433 1645 830 2
27 20 73.0 8987 1595 0 2
28 20 92.0 9215 1218 0 2
29 20 84.0 10440 1468 0 2
30 60 70.0 11920 831 828 2
... ... ... ... ... ... ...
1425 60 78.0 15600 1278 1037 2
1426 30 50.0 7288 976 0 1
1427 50 50.0 7000 861 424 1
1428 30 61.0 8534 672 0 1
1429 30 50.0 7030 641 0 1
1430 50 75.0 9060 967 671 2
1431 30 69.0 12366 729 0 1
1432 190 50.0 9000 1060 336 2
1433 50 60.0 8520 576 360 1
1434 120 41.0 5748 1778 0 2
1435 120 44.0 3842 1646 0 2
1436 20 69.0 23580 1625 0 2
1437 90 65.0 8385 1664 0 2
1438 20 70.0 9116 1491 0 2
1439 80 140.0 11080 1210 0 2
1442 20 95.0 13618 1960 0 2
1443 20 88.0 11577 1838 0 2
1445 90 78.0 7020 1368 0 2
1446 160 41.0 2665 616 688 1
1447 20 58.0 10172 874 0 1
1449 180 21.0 1470 630 0 1
1450 160 21.0 1484 546 546 1
1451 20 80.0 13384 1360 0 1
1452 160 21.0 1533 546 546 1
1453 160 21.0 1526 546 546 1
1454 160 21.0 1936 546 546 1
1455 160 21.0 1894 546 546 1
1456 20 160.0 20000 1224 0 1
1457 85 62.0 10441 970 0 1
1458 60 74.0 9627 996 1004 2
HalfBath BedroomAbvGr GarageArea SalePrice MSZoning_RL \
0 0 2 730.0 169277.0525 0
1 1 3 312.0 187758.3940 1
2 1 3 482.0 183583.6836 1
3 1 3 470.0 179317.4775 1
4 0 2 506.0 150730.0800 1
5 1 3 440.0 177150.9892 1
7 1 3 393.0 175110.9565 1
8 1 2 506.0 162011.6988 1
9 0 2 525.0 160726.2478 1
10 0 2 511.0 157933.2795 0
11 1 2 264.0 145291.2450 0
12 1 3 320.0 159672.0176 0
13 1 3 440.0 164167.5183 1
14 0 2 308.0 150891.6382 1
15 1 3 751.0 179460.9652 1
16 0 3 868.0 185034.6289 1
17 0 3 730.0 182352.1926 1
18 0 3 678.0 183053.4582 1
19 1 3 958.0 187823.3393 1
20 1 3 756.0 186544.1143 1
21 0 2 484.0 158230.7752 1
22 0 3 430.0 190552.8293 1
23 0 2 437.0 147183.6749 1
24 1 3 400.0 185855.3009 1
25 1 3 433.0 174350.4707 1
26 1 4 962.0 201740.6207 1
27 0 2 880.0 162986.3789 1
28 0 2 676.0 162330.1991 0
29 0 2 528.0 165845.9386 0
30 1 3 484.0 180929.6229 1
... ... ... ... ... ...
1425 0 4 342.0 208353.2696 1
1426 0 2 215.0 166548.0672 0
1427 0 3 506.0 175942.4746 0
1428 0 2 0.0 166790.4579 0
1429 0 2 272.0 160515.8506 0
1430 0 4 384.0 192167.6211 0
1431 0 2 0.0 178751.5511 0
1432 0 4 0.0 198678.8941 0
1433 0 2 0.0 164553.1203 0
1434 0 2 495.0 156887.9329 0
1435 0 2 525.0 164185.7773 0
1436 0 3 576.0 212992.1206 1
1437 0 4 616.0 197468.5505 1
1438 0 3 490.0 180106.8437 1
1439 0 3 528.0 183972.0711 1
1442 0 3 714.0 195596.3077 1
1443 0 3 682.0 189369.7563 1
1445 0 2 784.0 169335.3106 0
1446 1 3 336.0 167411.0284 0
1447 0 3 288.0 187709.5550 1
1449 0 1 0.0 137402.5699 0
1450 1 3 253.0 165086.7751 0
1451 0 3 336.0 188506.4314 1
1452 1 3 286.0 172917.4568 0
1453 1 3 0.0 166274.3252 0
1454 1 3 0.0 167081.2209 0
1455 1 3 286.0 164788.7782 0
1456 0 4 576.0 219222.4234 1
1457 0 3 0.0 184924.2797 1
1458 1 3 650.0 187741.8667 1
MSZoning_RM
0 0
1 0
2 0
3 0
4 0
5 0
7 0
8 0
9 0
10 0
11 1
12 1
13 0
14 0
15 0
16 0
17 0
18 0
19 0
20 0
21 0
22 0
23 0
24 0
25 0
26 0
27 0
28 0
29 0
30 0
... ...
1425 0
1426 1
1427 1
1428 1
1429 1
1430 1
1431 0
1432 0
1433 0
1434 1
1435 1
1436 0
1437 0
1438 0
1439 0
1442 0
1443 0
1445 1
1446 1
1447 0
1449 1
1450 1
1451 0
1452 1
1453 1
1454 1
1455 1
1456 0
1457 0
1458 0
[1226 rows x 12 columns]
# X and Y coefficients required for regression:
X = final_train_data[['MSSubClass','LotFrontage','LotArea','1stFlrSF','2ndFlrSF','FullBath','HalfBath',
'BedroomAbvGr','GarageArea','MSZoning_RL','MSZoning_RM']]
Y = final_train_data[['SalePrice']]
pip install -U scikit-learn
Requirement already up-to-date: scikit-learn in c:\users\rmadh\anaconda3\lib\site-packages (0.22) Requirement already satisfied, skipping upgrade: numpy>=1.11.0 in c:\users\rmadh\anaconda3\lib\site-packages (from scikit-learn) (1.16.4) Requirement already satisfied, skipping upgrade: scipy>=0.17.0 in c:\users\rmadh\anaconda3\lib\site-packages (from scikit-learn) (1.2.1) Requirement already satisfied, skipping upgrade: joblib>=0.11 in c:\users\rmadh\anaconda3\lib\site-packages (from scikit-learn) (0.13.2) Note: you may need to restart the kernel to use updated packages.
#Splitting the data into Train and Test
from sklearn.model_selection import train_test_split, KFold, cross_val_score
xtrain, xtest, ytrain, ytest = train_test_split(X,Y,test_size=1/3, random_state=0)
# 1/3rd of the train data will be selected as test data
print("xtrain : " + str(xtrain.shape))
print("xtest : " + str(xtest.shape))
print("ytrain : " + str(ytrain.shape))
print("ytest : " + str(ytest.shape))
print(xtrain)
print(ytrain)
xtrain : (817, 11)
xtest : (409, 11)
ytrain : (817, 1)
ytest : (409, 1)
MSSubClass LotFrontage LotArea 1stFlrSF 2ndFlrSF FullBath \
858 60 64.0 7713 728 728 2
61 20 80.0 10032 1580 0 1
796 60 60.0 7500 957 1342 3
1339 30 50.0 9000 819 0 1
1320 30 50.0 5633 844 0 1
933 80 80.0 9200 1042 0 2
1329 90 33.0 5976 624 624 2
822 160 21.0 2205 525 567 1
608 20 60.0 10122 869 0 1
802 60 98.0 12328 1164 1377 3
554 60 43.0 14565 832 825 2
817 20 63.0 9457 1422 0 1
588 90 72.0 11072 1728 0 2
844 120 34.0 3901 1302 0 1
330 90 81.0 9671 1248 1296 2
981 30 56.0 3153 967 0 1
434 20 70.0 11767 1560 0 1
869 60 59.0 9649 961 683 2
331 20 83.0 10143 1380 0 1
523 160 36.0 2448 764 862 2
835 60 134.0 16659 1582 570 2
946 20 65.0 6860 1008 0 1
918 60 80.0 8000 1403 1008 2
1404 180 35.0 3675 1072 0 1
1028 20 91.0 11375 1299 0 1
241 20 86.0 11194 1696 0 2
265 60 70.0 7703 833 897 2
88 50 52.0 6240 548 492 1
219 120 44.0 7390 1884 0 2
10 120 26.0 5858 1337 0 2
... ... ... ... ... ... ...
1400 20 73.0 16133 1176 0 1
795 60 63.0 7875 691 862 2
950 20 72.0 9770 922 0 1
1432 190 50.0 9000 1060 336 2
112 20 44.0 17485 1508 0 1
1009 80 120.0 13200 1497 797 3
533 60 41.0 12460 1037 1285 2
900 20 84.0 7476 1040 0 1
1169 60 63.0 12292 1102 1371 2
131 30 67.0 4853 1296 0 2
925 20 80.0 10800 1052 0 1
80 70 60.0 10560 868 602 1
1003 70 47.0 7755 1100 1164 2
635 30 60.0 6756 899 0 1
1008 50 40.0 5680 969 245 1
203 20 105.0 13693 2674 0 2
96 190 60.0 9900 1212 180 1
649 30 55.0 7111 1143 0 1
846 120 41.0 6289 1362 0 2
368 30 50.0 8635 1072 0 1
722 20 74.0 9627 1361 0 2
1312 70 62.0 9856 1007 1007 2
721 20 85.0 11058 1564 0 2
329 30 60.0 10800 725 0 1
1236 60 70.0 8304 941 896 2
909 160 24.0 2280 757 744 2
992 20 52.0 8626 968 0 1
1449 180 21.0 1470 630 0 1
659 50 60.0 5520 565 651 1
824 120 24.0 2304 1055 0 2
HalfBath BedroomAbvGr GarageArea MSZoning_RL MSZoning_RM
858 1 3 400.0 1 0
61 1 3 440.0 1 0
796 1 5 482.0 1 0
1339 0 2 0.0 0 1
1320 0 2 216.0 0 1
933 0 3 440.0 1 0
1329 0 2 0.0 0 1
822 1 3 264.0 0 1
608 0 1 390.0 1 0
802 1 4 729.0 1 0
554 1 3 483.0 1 0
817 0 3 576.0 1 0
588 0 6 576.0 1 0
844 1 1 631.0 1 0
330 2 6 907.0 1 0
981 0 2 180.0 0 1
434 0 2 313.0 1 0
869 1 3 460.0 1 0
331 1 3 364.0 1 0
523 1 2 474.0 1 0
835 1 3 728.0 1 0
946 0 3 308.0 1 0
918 1 4 570.0 1 0
1404 0 2 525.0 0 1
1028 0 3 494.0 1 0
241 0 3 972.0 1 0
265 1 3 528.0 1 0
88 0 2 624.0 0 1
219 0 2 649.0 1 0
10 0 2 511.0 0 0
... ... ... ... ... ...
1400 0 2 360.0 1 0
795 1 3 420.0 1 0
950 0 2 308.0 1 0
1432 0 4 0.0 0 0
112 0 1 572.0 1 0
1009 0 5 658.0 0 1
533 1 4 400.0 1 0
900 1 3 686.0 1 0
1169 1 4 675.0 1 0
131 0 2 260.0 1 0
925 0 3 311.0 1 0
80 1 2 624.0 0 1
1003 1 4 408.0 0 1
635 0 2 200.0 0 1
1008 0 2 216.0 0 1
203 1 2 762.0 1 0
96 0 3 0.0 1 0
649 0 2 288.0 1 0
846 0 2 460.0 1 0
368 0 2 1184.0 0 1
722 0 3 610.0 1 0
1312 0 5 624.0 0 1
721 0 3 814.0 1 0
329 1 1 320.0 1 0
1236 1 3 688.0 1 0
909 1 3 440.0 0 0
992 0 2 331.0 0 1
1449 0 1 0.0 0 1
659 0 3 355.0 0 1
824 0 2 319.0 1 0
[817 rows x 11 columns]
SalePrice
858 178657.7941
61 179423.7516
796 205895.2546
1339 171022.2414
1320 162186.0712
933 179374.4820
1329 168385.5711
822 162293.3761
608 157054.2539
802 201016.1711
554 195664.6861
817 184299.8381
588 228617.9683
844 145708.7643
330 221648.3920
981 156407.6592
434 169355.2025
869 180258.1312
331 180539.8808
523 157437.1928
835 195158.9726
946 182503.5201
918 191393.6086
1404 161647.3298
1028 189179.4282
241 185922.9668
265 181262.3300
88 157580.1776
219 160741.0986
10 157933.2795
... ...
1400 188374.9367
795 180081.5088
950 166115.4912
1432 198678.8941
112 162015.3185
1009 212693.5030
533 198168.4901
900 177086.4764
1169 201842.9987
131 154850.5084
925 179208.6657
80 167186.9958
1003 189806.5466
635 164810.6095
1008 163590.6772
203 170935.8592
96 174744.4003
649 163299.4780
846 161474.5349
368 166505.9840
722 180422.0500
1312 209219.8448
721 178818.7426
329 160805.5846
1236 180718.5817
909 170175.2268
992 164968.9479
1449 137402.5699
659 170129.5314
824 151422.1169
[817 rows x 1 columns]
xtrain.isnull() # check for null values
| MSSubClass | LotFrontage | LotArea | 1stFlrSF | 2ndFlrSF | FullBath | HalfBath | BedroomAbvGr | GarageArea | MSZoning_RL | MSZoning_RM | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 858 | False | False | False | False | False | False | False | False | False | False | False |
| 61 | False | False | False | False | False | False | False | False | False | False | False |
| 796 | False | False | False | False | False | False | False | False | False | False | False |
| 1339 | False | False | False | False | False | False | False | False | False | False | False |
| 1320 | False | False | False | False | False | False | False | False | False | False | False |
| 933 | False | False | False | False | False | False | False | False | False | False | False |
| 1329 | False | False | False | False | False | False | False | False | False | False | False |
| 822 | False | False | False | False | False | False | False | False | False | False | False |
| 608 | False | False | False | False | False | False | False | False | False | False | False |
| 802 | False | False | False | False | False | False | False | False | False | False | False |
| 554 | False | False | False | False | False | False | False | False | False | False | False |
| 817 | False | False | False | False | False | False | False | False | False | False | False |
| 588 | False | False | False | False | False | False | False | False | False | False | False |
| 844 | False | False | False | False | False | False | False | False | False | False | False |
| 330 | False | False | False | False | False | False | False | False | False | False | False |
| 981 | False | False | False | False | False | False | False | False | False | False | False |
| 434 | False | False | False | False | False | False | False | False | False | False | False |
| 869 | False | False | False | False | False | False | False | False | False | False | False |
| 331 | False | False | False | False | False | False | False | False | False | False | False |
| 523 | False | False | False | False | False | False | False | False | False | False | False |
| 835 | False | False | False | False | False | False | False | False | False | False | False |
| 946 | False | False | False | False | False | False | False | False | False | False | False |
| 918 | False | False | False | False | False | False | False | False | False | False | False |
| 1404 | False | False | False | False | False | False | False | False | False | False | False |
| 1028 | False | False | False | False | False | False | False | False | False | False | False |
| 241 | False | False | False | False | False | False | False | False | False | False | False |
| 265 | False | False | False | False | False | False | False | False | False | False | False |
| 88 | False | False | False | False | False | False | False | False | False | False | False |
| 219 | False | False | False | False | False | False | False | False | False | False | False |
| 10 | False | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1400 | False | False | False | False | False | False | False | False | False | False | False |
| 795 | False | False | False | False | False | False | False | False | False | False | False |
| 950 | False | False | False | False | False | False | False | False | False | False | False |
| 1432 | False | False | False | False | False | False | False | False | False | False | False |
| 112 | False | False | False | False | False | False | False | False | False | False | False |
| 1009 | False | False | False | False | False | False | False | False | False | False | False |
| 533 | False | False | False | False | False | False | False | False | False | False | False |
| 900 | False | False | False | False | False | False | False | False | False | False | False |
| 1169 | False | False | False | False | False | False | False | False | False | False | False |
| 131 | False | False | False | False | False | False | False | False | False | False | False |
| 925 | False | False | False | False | False | False | False | False | False | False | False |
| 80 | False | False | False | False | False | False | False | False | False | False | False |
| 1003 | False | False | False | False | False | False | False | False | False | False | False |
| 635 | False | False | False | False | False | False | False | False | False | False | False |
| 1008 | False | False | False | False | False | False | False | False | False | False | False |
| 203 | False | False | False | False | False | False | False | False | False | False | False |
| 96 | False | False | False | False | False | False | False | False | False | False | False |
| 649 | False | False | False | False | False | False | False | False | False | False | False |
| 846 | False | False | False | False | False | False | False | False | False | False | False |
| 368 | False | False | False | False | False | False | False | False | False | False | False |
| 722 | False | False | False | False | False | False | False | False | False | False | False |
| 1312 | False | False | False | False | False | False | False | False | False | False | False |
| 721 | False | False | False | False | False | False | False | False | False | False | False |
| 329 | False | False | False | False | False | False | False | False | False | False | False |
| 1236 | False | False | False | False | False | False | False | False | False | False | False |
| 909 | False | False | False | False | False | False | False | False | False | False | False |
| 992 | False | False | False | False | False | False | False | False | False | False | False |
| 1449 | False | False | False | False | False | False | False | False | False | False | False |
| 659 | False | False | False | False | False | False | False | False | False | False | False |
| 824 | False | False | False | False | False | False | False | False | False | False | False |
817 rows × 11 columns
ytrain.isnull()
| SalePrice | |
|---|---|
| 858 | False |
| 61 | False |
| 796 | False |
| 1339 | False |
| 1320 | False |
| 933 | False |
| 1329 | False |
| 822 | False |
| 608 | False |
| 802 | False |
| 554 | False |
| 817 | False |
| 588 | False |
| 844 | False |
| 330 | False |
| 981 | False |
| 434 | False |
| 869 | False |
| 331 | False |
| 523 | False |
| 835 | False |
| 946 | False |
| 918 | False |
| 1404 | False |
| 1028 | False |
| 241 | False |
| 265 | False |
| 88 | False |
| 219 | False |
| 10 | False |
| ... | ... |
| 1400 | False |
| 795 | False |
| 950 | False |
| 1432 | False |
| 112 | False |
| 1009 | False |
| 533 | False |
| 900 | False |
| 1169 | False |
| 131 | False |
| 925 | False |
| 80 | False |
| 1003 | False |
| 635 | False |
| 1008 | False |
| 203 | False |
| 96 | False |
| 649 | False |
| 846 | False |
| 368 | False |
| 722 | False |
| 1312 | False |
| 721 | False |
| 329 | False |
| 1236 | False |
| 909 | False |
| 992 | False |
| 1449 | False |
| 659 | False |
| 824 | False |
817 rows × 1 columns
xtrain = xtrain.fillna(0) # fill the null values with 0/
print(xtrain)
MSSubClass LotFrontage LotArea 1stFlrSF 2ndFlrSF FullBath \
858 60 64.0 7713 728 728 2
61 20 80.0 10032 1580 0 1
796 60 60.0 7500 957 1342 3
1339 30 50.0 9000 819 0 1
1320 30 50.0 5633 844 0 1
933 80 80.0 9200 1042 0 2
1329 90 33.0 5976 624 624 2
822 160 21.0 2205 525 567 1
608 20 60.0 10122 869 0 1
802 60 98.0 12328 1164 1377 3
554 60 43.0 14565 832 825 2
817 20 63.0 9457 1422 0 1
588 90 72.0 11072 1728 0 2
844 120 34.0 3901 1302 0 1
330 90 81.0 9671 1248 1296 2
981 30 56.0 3153 967 0 1
434 20 70.0 11767 1560 0 1
869 60 59.0 9649 961 683 2
331 20 83.0 10143 1380 0 1
523 160 36.0 2448 764 862 2
835 60 134.0 16659 1582 570 2
946 20 65.0 6860 1008 0 1
918 60 80.0 8000 1403 1008 2
1404 180 35.0 3675 1072 0 1
1028 20 91.0 11375 1299 0 1
241 20 86.0 11194 1696 0 2
265 60 70.0 7703 833 897 2
88 50 52.0 6240 548 492 1
219 120 44.0 7390 1884 0 2
10 120 26.0 5858 1337 0 2
... ... ... ... ... ... ...
1400 20 73.0 16133 1176 0 1
795 60 63.0 7875 691 862 2
950 20 72.0 9770 922 0 1
1432 190 50.0 9000 1060 336 2
112 20 44.0 17485 1508 0 1
1009 80 120.0 13200 1497 797 3
533 60 41.0 12460 1037 1285 2
900 20 84.0 7476 1040 0 1
1169 60 63.0 12292 1102 1371 2
131 30 67.0 4853 1296 0 2
925 20 80.0 10800 1052 0 1
80 70 60.0 10560 868 602 1
1003 70 47.0 7755 1100 1164 2
635 30 60.0 6756 899 0 1
1008 50 40.0 5680 969 245 1
203 20 105.0 13693 2674 0 2
96 190 60.0 9900 1212 180 1
649 30 55.0 7111 1143 0 1
846 120 41.0 6289 1362 0 2
368 30 50.0 8635 1072 0 1
722 20 74.0 9627 1361 0 2
1312 70 62.0 9856 1007 1007 2
721 20 85.0 11058 1564 0 2
329 30 60.0 10800 725 0 1
1236 60 70.0 8304 941 896 2
909 160 24.0 2280 757 744 2
992 20 52.0 8626 968 0 1
1449 180 21.0 1470 630 0 1
659 50 60.0 5520 565 651 1
824 120 24.0 2304 1055 0 2
HalfBath BedroomAbvGr GarageArea MSZoning_RL MSZoning_RM
858 1 3 400.0 1 0
61 1 3 440.0 1 0
796 1 5 482.0 1 0
1339 0 2 0.0 0 1
1320 0 2 216.0 0 1
933 0 3 440.0 1 0
1329 0 2 0.0 0 1
822 1 3 264.0 0 1
608 0 1 390.0 1 0
802 1 4 729.0 1 0
554 1 3 483.0 1 0
817 0 3 576.0 1 0
588 0 6 576.0 1 0
844 1 1 631.0 1 0
330 2 6 907.0 1 0
981 0 2 180.0 0 1
434 0 2 313.0 1 0
869 1 3 460.0 1 0
331 1 3 364.0 1 0
523 1 2 474.0 1 0
835 1 3 728.0 1 0
946 0 3 308.0 1 0
918 1 4 570.0 1 0
1404 0 2 525.0 0 1
1028 0 3 494.0 1 0
241 0 3 972.0 1 0
265 1 3 528.0 1 0
88 0 2 624.0 0 1
219 0 2 649.0 1 0
10 0 2 511.0 0 0
... ... ... ... ... ...
1400 0 2 360.0 1 0
795 1 3 420.0 1 0
950 0 2 308.0 1 0
1432 0 4 0.0 0 0
112 0 1 572.0 1 0
1009 0 5 658.0 0 1
533 1 4 400.0 1 0
900 1 3 686.0 1 0
1169 1 4 675.0 1 0
131 0 2 260.0 1 0
925 0 3 311.0 1 0
80 1 2 624.0 0 1
1003 1 4 408.0 0 1
635 0 2 200.0 0 1
1008 0 2 216.0 0 1
203 1 2 762.0 1 0
96 0 3 0.0 1 0
649 0 2 288.0 1 0
846 0 2 460.0 1 0
368 0 2 1184.0 0 1
722 0 3 610.0 1 0
1312 0 5 624.0 0 1
721 0 3 814.0 1 0
329 1 1 320.0 1 0
1236 1 3 688.0 1 0
909 1 3 440.0 0 0
992 0 2 331.0 0 1
1449 0 1 0.0 0 1
659 0 3 355.0 0 1
824 0 2 319.0 1 0
[817 rows x 11 columns]
(np.isnan(xtrain))
| MSSubClass | LotFrontage | LotArea | 1stFlrSF | 2ndFlrSF | FullBath | HalfBath | BedroomAbvGr | GarageArea | MSZoning_RL | MSZoning_RM | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 858 | False | False | False | False | False | False | False | False | False | False | False |
| 61 | False | False | False | False | False | False | False | False | False | False | False |
| 796 | False | False | False | False | False | False | False | False | False | False | False |
| 1339 | False | False | False | False | False | False | False | False | False | False | False |
| 1320 | False | False | False | False | False | False | False | False | False | False | False |
| 933 | False | False | False | False | False | False | False | False | False | False | False |
| 1329 | False | False | False | False | False | False | False | False | False | False | False |
| 822 | False | False | False | False | False | False | False | False | False | False | False |
| 608 | False | False | False | False | False | False | False | False | False | False | False |
| 802 | False | False | False | False | False | False | False | False | False | False | False |
| 554 | False | False | False | False | False | False | False | False | False | False | False |
| 817 | False | False | False | False | False | False | False | False | False | False | False |
| 588 | False | False | False | False | False | False | False | False | False | False | False |
| 844 | False | False | False | False | False | False | False | False | False | False | False |
| 330 | False | False | False | False | False | False | False | False | False | False | False |
| 981 | False | False | False | False | False | False | False | False | False | False | False |
| 434 | False | False | False | False | False | False | False | False | False | False | False |
| 869 | False | False | False | False | False | False | False | False | False | False | False |
| 331 | False | False | False | False | False | False | False | False | False | False | False |
| 523 | False | False | False | False | False | False | False | False | False | False | False |
| 835 | False | False | False | False | False | False | False | False | False | False | False |
| 946 | False | False | False | False | False | False | False | False | False | False | False |
| 918 | False | False | False | False | False | False | False | False | False | False | False |
| 1404 | False | False | False | False | False | False | False | False | False | False | False |
| 1028 | False | False | False | False | False | False | False | False | False | False | False |
| 241 | False | False | False | False | False | False | False | False | False | False | False |
| 265 | False | False | False | False | False | False | False | False | False | False | False |
| 88 | False | False | False | False | False | False | False | False | False | False | False |
| 219 | False | False | False | False | False | False | False | False | False | False | False |
| 10 | False | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1400 | False | False | False | False | False | False | False | False | False | False | False |
| 795 | False | False | False | False | False | False | False | False | False | False | False |
| 950 | False | False | False | False | False | False | False | False | False | False | False |
| 1432 | False | False | False | False | False | False | False | False | False | False | False |
| 112 | False | False | False | False | False | False | False | False | False | False | False |
| 1009 | False | False | False | False | False | False | False | False | False | False | False |
| 533 | False | False | False | False | False | False | False | False | False | False | False |
| 900 | False | False | False | False | False | False | False | False | False | False | False |
| 1169 | False | False | False | False | False | False | False | False | False | False | False |
| 131 | False | False | False | False | False | False | False | False | False | False | False |
| 925 | False | False | False | False | False | False | False | False | False | False | False |
| 80 | False | False | False | False | False | False | False | False | False | False | False |
| 1003 | False | False | False | False | False | False | False | False | False | False | False |
| 635 | False | False | False | False | False | False | False | False | False | False | False |
| 1008 | False | False | False | False | False | False | False | False | False | False | False |
| 203 | False | False | False | False | False | False | False | False | False | False | False |
| 96 | False | False | False | False | False | False | False | False | False | False | False |
| 649 | False | False | False | False | False | False | False | False | False | False | False |
| 846 | False | False | False | False | False | False | False | False | False | False | False |
| 368 | False | False | False | False | False | False | False | False | False | False | False |
| 722 | False | False | False | False | False | False | False | False | False | False | False |
| 1312 | False | False | False | False | False | False | False | False | False | False | False |
| 721 | False | False | False | False | False | False | False | False | False | False | False |
| 329 | False | False | False | False | False | False | False | False | False | False | False |
| 1236 | False | False | False | False | False | False | False | False | False | False | False |
| 909 | False | False | False | False | False | False | False | False | False | False | False |
| 992 | False | False | False | False | False | False | False | False | False | False | False |
| 1449 | False | False | False | False | False | False | False | False | False | False | False |
| 659 | False | False | False | False | False | False | False | False | False | False | False |
| 824 | False | False | False | False | False | False | False | False | False | False | False |
817 rows × 11 columns
xtest = xtest.fillna(0)
ytest = ytest.fillna(0)
print(xtest)
MSSubClass LotFrontage LotArea 1stFlrSF 2ndFlrSF FullBath \
1285 20 66.0 12778 1008 0 1
5 60 75.0 10000 763 892 2
1089 20 128.0 39290 5095 0 2
141 50 50.0 9140 727 380 1
1394 20 70.0 8810 1390 0 2
1124 20 80.0 10000 1392 0 2
1293 190 94.0 22136 1392 1248 2
1188 20 65.0 7514 1387 0 1
58 20 70.0 8197 1480 0 2
628 30 60.0 4800 672 0 1
494 60 75.0 9743 720 588 1
1335 30 50.0 11672 816 0 1
19 20 110.0 14300 2696 0 2
62 50 60.0 8382 832 505 1
1045 20 71.0 9178 1651 0 2
189 20 60.0 7038 864 0 1
321 20 53.0 5470 792 0 1
737 30 60.0 7290 1424 0 2
1353 90 75.0 8235 1949 0 2
984 50 50.0 5925 761 380 1
498 90 60.0 9000 896 896 2
1231 20 74.0 7733 1142 0 1
1209 60 96.0 11631 1052 1321 2
227 60 62.0 7984 868 762 2
1291 30 118.0 21299 2039 0 1
178 20 80.0 10928 1671 0 2
426 20 70.0 8750 1417 0 2
590 20 74.0 7785 1014 0 1
1172 120 60.0 8147 1714 0 2
609 50 45.0 7506 747 412 1
... ... ... ... ... ... ...
1343 20 100.0 21370 1640 0 1
920 20 130.0 11700 1295 0 1
430 180 35.0 3675 1072 0 1
1065 20 45.0 8982 1040 0 1
312 20 80.0 10400 1200 0 1
20 60 105.0 13650 1687 563 2
527 60 80.0 9024 813 702 2
31 30 70.0 9800 1012 0 1
1408 50 62.0 8707 1093 576 1
515 20 92.0 10845 1575 0 2
1430 50 75.0 9060 967 671 2
723 20 75.0 9825 1092 0 1
304 20 90.0 9900 1691 0 1
764 90 76.0 10260 936 936 2
733 50 57.0 8050 929 208 1
1305 50 60.0 12144 910 611 1
740 50 63.0 15564 676 588 1
1012 190 52.0 4330 838 477 2
1138 20 85.0 12633 1824 0 2
316 30 80.0 8480 793 0 1
431 20 64.0 6410 876 0 1
1323 190 50.0 6000 960 0 1
115 60 87.0 12361 964 838 2
601 20 75.0 10425 1104 0 1
691 30 85.0 19550 1518 0 1
966 70 60.0 10800 751 783 1
369 70 57.0 8094 1048 720 2
631 50 63.0 7628 1095 561 2
690 50 50.0 5190 617 462 1
1310 20 70.0 7000 864 0 1
HalfBath BedroomAbvGr GarageArea MSZoning_RL MSZoning_RM
1285 0 2 280.0 1 0
5 1 3 440.0 1 0
1089 1 2 1154.0 1 0
141 0 2 625.0 0 1
1394 0 3 545.0 1 0
1124 0 3 650.0 1 0
1293 1 5 1008.0 1 0
1188 0 3 300.0 1 0
58 0 3 620.0 1 0
628 0 2 256.0 0 1
494 1 3 484.0 1 0
1335 0 2 210.0 1 0
19 1 3 958.0 1 0
62 0 3 263.0 1 0
1045 0 3 870.0 1 0
189 0 3 576.0 1 0
321 0 2 366.0 1 0
737 0 2 312.0 1 0
1353 0 6 586.0 1 0
984 0 3 252.0 0 1
498 2 4 480.0 1 0
1231 1 3 0.0 1 0
1209 1 4 632.0 1 0
227 1 3 436.0 1 0
1291 1 3 791.0 1 0
178 0 3 484.0 1 0
426 0 3 511.0 1 0
590 0 2 267.0 1 0
1172 0 2 517.0 1 0
609 0 3 288.0 1 0
... ... ... ... ... ...
1343 0 3 394.0 1 0
920 1 3 528.0 1 0
430 0 2 525.0 0 1
1065 1 3 748.0 1 0
312 0 3 294.0 1 0
20 1 3 756.0 1 0
527 1 3 393.0 1 0
31 0 2 429.0 0 0
1408 1 4 288.0 1 0
515 0 2 732.0 1 0
1430 0 4 384.0 0 1
723 0 3 264.0 1 0
304 0 2 550.0 1 0
764 2 4 484.0 1 0
733 1 4 0.0 1 0
1305 1 3 597.0 1 0
740 0 3 400.0 1 0
1012 0 3 436.0 0 1
1138 0 3 932.0 1 0
316 0 2 240.0 1 0
431 0 3 320.0 1 0
1323 0 3 576.0 0 1
115 1 3 1017.0 1 0
601 0 3 384.0 1 0
691 0 2 0.0 1 0
966 1 3 0.0 1 0
369 0 3 576.0 0 1
631 0 2 440.0 0 1
690 0 2 249.0 1 0
1310 0 3 336.0 1 0
[409 rows x 11 columns]
print(ytest)
SalePrice 1285 169618.3499 5 177150.9892 1089 230841.3386 141 162182.5962 1394 177294.8389 1124 188683.1606 1293 234492.2483 1188 181373.6165 58 171392.8310 628 159856.2339 494 180650.3424 1335 174071.0548 19 187823.3393 62 171756.9181 1045 183750.7554 189 177743.7994 321 160277.1426 737 160337.0792 1353 218200.6421 984 172929.1564 498 192463.9700 1231 172965.4765 1209 199437.2328 227 179605.5637 1291 211817.7028 178 184294.5760 426 180008.1885 590 160206.3656 1172 168238.6122 609 175142.9481 ... ... 1343 205328.0782 920 183189.6853 430 152324.9436 1065 174525.8794 312 186570.1294 20 186544.1143 527 180340.2154 31 163481.5015 1408 192577.7961 515 169543.2406 1430 192167.6211 723 178601.9365 304 164575.9823 764 200468.1611 733 188384.5484 1305 190485.6341 740 186581.8309 1012 166475.4576 1138 190550.3279 316 166200.9381 431 169878.5160 1323 175079.3288 115 182902.4203 601 179782.7595 691 181150.5517 966 184733.1285 369 176507.3702 631 162107.1687 690 152889.5210 1310 172627.7969 [409 rows x 1 columns]
import sklearn
from sklearn import linear_model
model_reg = linear_model.LinearRegression() # form a linear regression model
ytrain = ytrain.fillna(0)
print(ytrain)
SalePrice 858 178657.7941 61 179423.7516 796 205895.2546 1339 171022.2414 1320 162186.0712 933 179374.4820 1329 168385.5711 822 162293.3761 608 157054.2539 802 201016.1711 554 195664.6861 817 184299.8381 588 228617.9683 844 145708.7643 330 221648.3920 981 156407.6592 434 169355.2025 869 180258.1312 331 180539.8808 523 157437.1928 835 195158.9726 946 182503.5201 918 191393.6086 1404 161647.3298 1028 189179.4282 241 185922.9668 265 181262.3300 88 157580.1776 219 160741.0986 10 157933.2795 ... ... 1400 188374.9367 795 180081.5088 950 166115.4912 1432 198678.8941 112 162015.3185 1009 212693.5030 533 198168.4901 900 177086.4764 1169 201842.9987 131 154850.5084 925 179208.6657 80 167186.9958 1003 189806.5466 635 164810.6095 1008 163590.6772 203 170935.8592 96 174744.4003 649 163299.4780 846 161474.5349 368 166505.9840 722 180422.0500 1312 209219.8448 721 178818.7426 329 160805.5846 1236 180718.5817 909 170175.2268 992 164968.9479 1449 137402.5699 659 170129.5314 824 151422.1169 [817 rows x 1 columns]
model_reg.fit(xtrain,ytrain) # check for fit of regression model
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
model_reg.score(xtrain,ytrain) # check the score of regression model
0.9512519845964049
model_reg.coef_
array([[ 2.62367221e+00, -4.61232867e+00, 2.00775613e+00,
-2.31956830e-01, -1.79324640e-01, -1.13598669e+02,
-1.94142292e+02, 1.34823355e+04, 4.34307634e-01,
1.29141441e+02, -2.39764954e+02]])
model_reg.intercept_
array([121507.58127551])
predictions = model_reg.predict(xtest) # we can observe the predictions in an excel file on the computer by following below path
predictionsdf = pd.DataFrame(predictions)
predictionsdf.to_excel("C:/Users/rmadh/OneDrive/Desktop/Lecture_Notes/IDS/Project/output1.xlsx")
ytest.to_excel("C:/Users/rmadh/OneDrive/Desktop/Lecture_Notes/IDS/Project/output1.xlsx")
from sklearn import metrics
print("Evaluation of the prediction model : Mean Absolute error")
print(metrics.mean_absolute_error(ytest,predictionsdf))
print("Mean Squared error")
print(metrics.mean_squared_error(ytest,predictionsdf))
Evaluation of the prediction model : Mean Absolute error 2915.135547907341 Mean Squared error 12587212.795744259
We can hereby conclude that having a bedroom above ground level is highly correlated with our main attribute SalePrice and also affects the house value, the most. Also, multiple linear regression best fits our dataset as the R square value is 0.95 which is a good number.